There are 113,937 observations with 81 variables to analyze. Variables include basic loan information and applicant information, to detailed description of the loan backer and financial institution information. Most variables are numerica with String variables as factor variables.
There are many different case studies to analyze, from trends of lower income to upper income customers, to analyzing what products customers will seek to finance, to the credit worthiness of the borrower and / or the loan.
We want to analyze the various Loan Status’ and see which variables seem to impact and help identify trends among a borrower’s loan. Below are some initial data investigations.
head(df)
names(df)
print(summary(na.omit(df)))
#Borrower Profile
summary((df$CreditScoreRangeLower + df$CreditScoreRangeLower)/2)
table(df$OpenCreditLines)
summary(df$IncomeRange)
summary(df$DebtToIncomeRatio)
table(df$TotalInquiries)
table(df$EmploymentStatus)
table(df$IsBorrowerHomeowner)
#Current Loan Data
count(df$LoanStatus)
summary(df$LoanCurrentDaysDelinquent)
table(df$CreditGrade)
summary(df$MonthlyLoanPayment)
Make preliminary changes to data set
#Calculate the mean credit score (upper - lower)
df$CreditScoreMean = (df$CreditScoreRangeUpper + df$CreditScoreRangeLower)/2
#Factor the credit grade
df$CreditGrade = factor(df$CreditGrade, levels = c('NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA'))
#Factor income levels
df$IncomeRange = factor(df$IncomeRange,
levels = c('$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999',
'$75,000-99,999', '$100,000+', 'Not displayed', 'Not employed'))
#Define loan status category summary: good-standing or bad-standing or late
df$LoanStatusBinary =
ifelse(df$LoanStatus=='Current' | df$LoanStatus=='Completed' | df$LoanStatus=='FinalPaymentInProgress', 'Good',
ifelse(df$LoanStatus=='Cancelled' | df$LoanStatus=='Chargedoff' | df$LoanStatus=='Defaulted','Bad',
'Late'))
#Income in annual terms
df$StatedAnnualIncome = df$StatedMonthlyIncome * 12
#Define variable analysis as array
borrowerProfile = c( "CreditScoreMean" , "OpenCreditLines" , "IncomeRange" ,
"DebtToIncomeRatio" , "TotalInquiries" , "EmploymentStatus" , "IsBorrowerHomeowner" )
loanData = c( "LoanStatus" , "LoanCurrentDaysDelinquent" , "CreditGrade" , "MonthlyLoanPayment")
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Warning: Removed 22 rows containing non-finite values (stat_boxplot).
## Warning: Removed 22 rows containing non-finite values (stat_boxplot).
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Saving 7 x 5 in image
## Warning: Removed 3007 rows containing non-finite values (stat_bin).
## Warning: Removed 3007 rows containing non-finite values (stat_bin).
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7604 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7604 rows containing non-finite values (stat_bin).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Warning: Removed 1005 rows containing non-finite values (stat_bin).
## Warning: Removed 1005 rows containing non-finite values (stat_bin).
## Warning: Removed 8554 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9353 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 113396 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 8554 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9353 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 113396 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
I am interested in investigating any correlation to the loan status. I chose a handful of variables relative to the loan’s financial information and a handful of variables relative to the borrower’s profile.
I created one new variable to analyze the borrowers credit score. The dataset included an upper and lower range of the credit score; I simplified this into one number - the mean credit score.
There were some adjustments to the dataset, including empty values. There appears to be some obvious distributions that I would like to analyze.
Going forward, I want to analyze loans that are in good condition (completed or current) and loans that are in bad condition (defaulted or charged off). From here, we can look for some important indicators to anticipate and predict the result of loans that are past due.
## Saving 7 x 5 in image
## Warning: Removed 46134 rows containing missing values (geom_point).
## Warning: Removed 46311 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Warning: Removed 46435 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Warning: Removed 46529 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_vline).
## Saving 7 x 5 in image
## Warning: Removed 47238 rows containing missing values (geom_point).
## Warning: Removed 47466 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Warning: Removed 584 rows containing missing values (geom_point).
## Warning: Removed 584 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Warning: Removed 38184 rows containing missing values (geom_point).
## Warning: Removed 38218 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
Not suprisingly, higher income persons had better cedit scores and were more likely to stay in good standing on their monthly payment.
The median loan amount across all categories is about $6800, with the highest median categories being debt consolidation, business, baby & adoption, boats, and weddings. As discovered in the univariate analysis, debt consolidation by far is the largest category of loans.
The higher the loan amount, the more likely that the loan will be in good standing. This may be intuitive as only higher income persons would qualify for more expensive loans.
Most loans seems to follow a somewhat normal distribution among credit scores. Popular loan types among those with higher credit scores used debt to finance their business, home improvements, and debt consolidation. And, although accounting a smaller number of loans, expensive items such as engagement rings, baby and adoption, large purchases, and wedding loans
Credit scores, credit grades, and income all tell a similar story. However, there is much variance across income. Generally speaking, higher income earners have higher credit scores, and therefore their loans are of higher grade.
## Saving 7 x 5 in image
## Warning: Removed 1721 rows containing missing values (geom_point).
## Warning: Removed 1740 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Warning: Removed 10412 rows containing missing values (geom_point).
## Warning: Removed 10412 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Warning: Removed 8969 rows containing missing values (geom_point).
## Warning: Removed 8969 rows containing missing values (geom_point).
## Warning: Removed 8969 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Warning: Removed 8969 rows containing missing values (geom_point).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Warning: Removed 1189 rows containing missing values (geom_point).
## Warning: Removed 1189 rows containing missing values (geom_point).
Smaller loans among lower income borrowers were more likely to be in a bad state, whereas higher loans among upper income borrowers were more likely to be in a good state.
Personal loans and Student use across all income types generated were the most likely to be in a bad loan state.
## Saving 7 x 5 in image
## Warning: Removed 1189 rows containing missing values (geom_point).
## Warning: Removed 1189 rows containing missing values (geom_point).
This visualization communicates that lower income borrowers can be risky investments when using for debt consolidation or personal loans.
## Saving 7 x 5 in image
## Warning: Removed 1736 rows containing missing values (geom_point).
## Warning: Removed 1722 rows containing missing values (geom_point).
This visualization tells us that higher income borrowers tend to borrow more money, but are more likely to remain in good standing.
## Saving 7 x 5 in image
## Warning: Removed 8969 rows containing missing values (geom_point).
## Warning: Removed 8969 rows containing missing values (geom_point).
This graph tells us that across income spectrums, the most risky loans are for debt consolidation, home improvements, business loans, and student use. Loans used for baby and adoption, cosmetic proecedures, engagement rings, and taxes are significantly likely to be in better standing.
There was a lot of data to analyze. It took some time to get a sense of what the dataset contains. Then I needed to make some adjustments so that I can work with it the way I want to. I added a few columns, refactored some variables, and had to identify null/empty or ambiguous terms.